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Abstract 

Background: Gene set analysis is moving towards considering pathway topology as a crucial feature. Pathway 
elements are complex entities such as protein complexes, gene family members and chemical compounds. The 
conversion of pathway topology to a gene/protein networks (where nodes are a simple element like a gene/ 
protein) is a critical and challenging task that enables topology-based gene set analyses. 

Unfortunately, currently available R/Bioconductor packages provide pathway networks only from single databases. 
They do not propagate signals through chemical compounds and do not differentiate between complexes and 
gene families. 

Results: Here we present graphite, a Bioconductor package addressing these issues. Pathway information from 
four different databases is interpreted following specific biologically-driven rules that allow the reconstruction of 
gene-gene networks taking into account protein complexes, gene families and sensibly removing chemical 
compounds from the final graphs. The resulting networks represent a uniform resource for pathway analyses. 
Indeed, graphite provides easy access to three recently proposed topological methods. The graphite package is 
available as part of the Bioconductor software suite. 

Conclusions: graphite is an innovative package able to gather and make easily available the contents of the 
four major pathway databases. In the field of topological analysis graphite acts as a provider of biological 
information by reducing the pathway complexity considering the biological meaning of the pathway elements. 



1 Background 

A great deal of effort has been recently directed towards 
the study of gene sets (hereafter GSA) in the context of 
microarray data analysis. The aim is to identify groups 
of functionally related genes with possibly moderate, but 
coordinated, expression changes. Several GSA tests, 
both univariate and multivariate, have been recently 
developed. See [1] for a comprehensive review, and [2-4] 
for a detailed description and a critical investigation of 
the tested hypotheses. 

These approaches, although effective, miss the informa- 
tion of the topological properties of the pathways. To this 
end, the seminal paper by Draghici et al. [5] proposed a 
radically different approach (called impact analysis, 
SPIA) attempting to capture several aspects of the data: 
the fold change of differentially expressed genes (DEGs), 



* Correspondence: chiara.romualdi@unipd.it 
f Contributed equally 

department of Biology, University of Padova, via U. Bassi 58/B, Padova, Italy 
Full list of author information is available at the end of the article 

(3 BioMed Central 



the pathway enrichment and the topology of signaling 
pathways. In particular, SPIA enhances the impact of a 
pathway if the DEGs tend to lie near its entry points. 
Massa et al. [6] introduced an alternative approach that 
is based on a correlation structure test. Specifically, the 
graphical model theory is used to decompose the overall 
pathway into smaller cliques, with the aim of exploring in 
detail small portions of the entire model. Recently, Isci et 
al. [7] proposed a Bayesian Pathway Analysis that models 
each biological pathway as a Bayesian network (BN) and 
considers the degree to which observed experimental 
data fits the model. Finally, Laurent et al. [8] developed a 
graph-structured two-sample test of means for problems 
in which the distribution shift is assumed to be smooth 
on a given graph. 

In this perspective the retrieval of pathway informa- 
tion and the subsequent conversion into a gene/protein 
network is crucial. However, pathway annotations com- 
prise a myriad of interactions, reactions, and regulations 
which is often too rich for the conversion to a network. 
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In particular, challenges are posed by the presence of 
chemical compounds mediating interactions and by dif- 
ferent types of gene groups (e.g. protein complexes or 
gene families) that are usually represented as single 
nodes. Available R packages (KEGGgraph, [9] and 
NCIpath) share some drawbacks: i) they are focused 
on a single pathway database each; ii) they do not con- 
sider gene connections through chemical compounds; 
iii) they do not handle differently the various kinds of 
biological gene groups. 

Here we present graphite (GRAPH Interaction 
from pathway Topological Environment) a Bioconductor 
package that provides networks from the pathways of 
four databases (Biocarta; KEGG, [10]; NCI/Nature Path- 
way Interaction Database, [11]; Reactome, [12]). It dis- 
criminates between different types of biological gene 
groups; propagates gene connections through chemical 
compounds; allows the selection of edges by type of 
interaction; uniformly converts heterogeneous node IDs 
to EntrezGene IDs and HUGO symbols; and finally 
allows the user to directly run SPIA, DEGraph, and 
topologyGSA analyses over the provided networks. 

2 Implementation 

graphite was implemented using the statistical program- 
ming language R and the package is included in the 
open-source Bioconductor project [13]. In section 2.1 
we report a brief state of the art of pathway formats, 
databases and tools, while in section 2.2 we report the 
rules that graphite uses to convert pathway topology 
to gene networks. 

2.1 Pathways Background 

A variety of databases containing information on cell 
signaling pathways have been developed in conjunction 
with methodologies to access and analyse the data [14]. 
Pathway databases serve as repositories of current 
knowledge on cell signaling. They present pathways in a 
graphical format comparable to the representation pre- 
sent in text books, as well as in standard formats allow- 
ing the exchange between different software platforms 
and further processing by network analysis, visualization 
and modeling tools. At the present day, there exist a 
vast variety of databases containing biochemical reac- 
tions, such as signaling pathways or protein-protein 
interactions. The Pathguide resource serves as a good 
overview of current pathway databases [15]. It lists more 
than 200 pathway repositories; over 60 of those are spe- 
cialized on reactions of the human species. However, 
only half of them provide pathways and reactions in 
computer-readable formats needed for automatic retrie- 
val and processing, such as Biological Pathway Exchange 
(BioPAX, [16]), Systems Biology Markup Language 
(SBML, [17]) and Biological Connection Markup 



Language (BCML, [18]). Thus, different databases are 
characterised by different annotations and only a part of 
the whole set of reactions are confirmed by all the repo- 
sitories. On the other hand, Cerami et al. [19] have 
recently developed a web repository aiming at collecting 
and integrating all public pathway data available in stan- 
dard formats. It currently contains data from nine data- 
bases with over 1400 pathways and 687,000 interactions. 

From the graphical point of view, a number of soft- 
ware tools [20-28] have been developed to visually build 
computable models of pathways. For additional details 
see the web page http://www.sbgn.org/. These tools are 
usually based on graphical models in which nodes repre- 
sent genes, proteins or chemical compounds, and edges 
represent various types of interactions or associations. 

In order to gather curated pathway data we collect 
pathway information from the three public databases 
that have emerged as reference points for the system 
biology community. Reactome [12] (data was retrieved 
in the BioPax format from the Reactome web site), 
backed by the EBI, is one of the most complete reposi- 
tories; it is frequently updated and provides a semanti- 
cally rich description of each pathway. KEGG Pathways 
[10] (retrieved in KGML format) provides maps for both 
signalling and metabolic pathways, supplemented by 19 
highly interconnected databases with genomic, chemical 
and phenotypic information. BioCarta (http://www.bio- 
carta.com) and NCI (NCI/Nature Pathway Interaction 
Database) [11] whose data were retrieved in BioPax for- 
mat from the PDI database web page. 

2.2 Pathway topology conversion to gene network 

Pathway topologies converted to gene-gene networks 
(simple interaction format, hereafter SIF) is available 
either on the Reactome [12] or on Pathway Commons 
[19] web sites. However, they provide a single huge file 
of protein-protein reaction for each database. 

Unfortunately topological pathway analysis (TPA) 
methods are not based on the analysis of the interaction 
network as a whole, but needs a separate graph for each 
pathway, in order to identify those that are significantly 
involved in the biological problem under investigation. 
Moreover, TPA benefits of long paths in which the gene 
signal can spread across compounds and cell compart- 
ments. Paxtools, a Java library for working with BioPAX 
(http://www.biopax.org/paxtools.php), defines some 
rules to convert a BioPax file to a SIF. However, it does 
not take into account signal compound propagation and 
gene group expansion. We start from Paxtools rules and 
we extend them in order to reduce both BioPAX and 
KGML interactions to pairwise relationships with com- 
pound mediated signal propagation. Table 1 and Figure 
1 report respectively pathway summary statistics and 
nodes/edges distributions for the four databases 
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Table 1 Number of pathways converted to networks with 
average number of edges and nodes according to the 
selected database. 



Database 


N. of 


Mean (Median) 


Mean (Median) 




pathways 


nodes 


edges 


KEGG 


232 


71.86 (54) 


211.12 (75.5) 


Reactome 


1070 


33.22 (14) 


780.64 (33) 


BioCarta 


254 


15.18 (14) 


36.88 (28) 


NCI 


177 


76.79 (48) 


165.18 (81) 



obtained after the conversion. In the following we 
describe in detail some of our rules. 

2.2. 1 Pathway definition 

The KEGG database provides separate xml files, one for 
each pathway. In the the other databases, we consider as 
a pathway every instance of the BioPax class pathway. 

2.2.2 A/odes with multiple elements 

Within a pathway, nodes often correspond to multiple 
gene products. These can be divided into protein com- 
plexes (proteins linked by protein-protein interactions) 
and groups containing alternative members (like gene 
families, genes with similar biochemical functions). These 
groups should be considered differently: the first kind 
(hereafter group AND) should be expanded into a clique 
(all proteins connected to the others), while the second 
(hereafter group OR) should be expanded without con- 
nections among them see Figure 2 (panel A and B). 

In the KGML format there are two ways of defining 
nodes with multiple elements: protein complexes (group 
AND defined by entry type = "group", see Figure 2A) 
and groups with alternative members (group OR defined 
by entry type = "gene", see Figure 2B). An example of 
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Figure 1 Edges and nodes distribution of networks after 
pathway conversion according to the selected database. 
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Figure 2 Toy examples of nodes with multiple elements 
converted to gene-network. Group AND (protein complexes, 
Panel A), group OR (member of gene family, Panel B) and 
compound mediated signal (panel C). 



the KGML for AND and OR groups can be seen in Addi- 
tional file 1. 

BioPax allows only one type of group: protein com- 
plexes (group AND) corresponding to the complex 
class. However, it often happens that a protein 
instance contains multiple xref s pointing to alternative 
elements of the process (group OR). An example of Bio- 
Pax definitions for both the AND and the OR groups 
can be seen in Additional file 1. 
2.2.3 Compound mediated interactions 
Compound-mediated interactions are interactions for 
which a compound acts as a bridge between two elements 
(see Figure 2C). As chemical compounds are not usually 
measured with high-throughput technologies, they should 
be removed from the network. However, the trivial elimi- 
nation of the compounds will strongly bias the topology 
interrupting the signals passing through them. If element A 
is linked to compound c and compound c is linked to ele- 
ment B, then A should be linked to B. Moreover, to best fit 
the biological model we take into account cell compart- 
ment membership: the connection among genes A and B is 
established only if the shared compound c has the same 
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localization in both the reactions. However, there are some 
chemical compounds that are highly frequent in map 
description (such as hydrogen, H 2 0,..). Signal propagation 
through them would lead to degenerate and long chains of 
compounds. To remove this artifact, we decided to ignore 
these compounds during the signal propagation. After par- 
sing all the BioPax and KGML data we obtain compound 
chains whose distribution are reported in Table 2. 

Within the KGML format there are two different ways 
of describing a compound mediated interaction: i) direct 
interaction type = "PPrel" (A interacts to B through 
compound c) and ii) indirect one type = "PCrel" {A 
interacts to compound c and c interacts to B). 

The BioPax format, on the other hand, provides only 
an indirect way of defining compound mediated signals 
(see Additional file 1). 
2.2.4 Relation attributes 

graphite allows the user to see the single/multiple 
relation types that characterize an edge. The edge types 
have been kept as close as possible to those annotated in 
the original formats. Some new types have been intro- 
duced due to the needs of the topological conversion. 

3 Results and discussion 

In sections 3.1 and 3.2 we provide two examples of 
pathway topologies converted to gene networks by 
graphite, while in section 3.3 we show the core func- 
tions to retrieve, convert and display graphite net- 
works. In section 3.4 we perform a simulation study to 
verify the efficacy of our signal propagation strategy in 
terms of topological analyses. In section 3.5 we run an 
example of topological gene set analysis using gra- 
phite networks and in section 3.6 we critically com- 
pare graphite with other available R/Bioconductor 
packages providing pathway topologies. 

3.1 Pathway conversion example 1: Insulin 
signaling pathway 

Figure 3 represent an example in which the simple elim- 
ination of compounds leads to an incorrect network 
topology. 

Table 2 Frequency of compound chains that we 
propagate according to different databases. 

Chain length KEGG Reactome Biocarta NCI 
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Figure 3 Differences in signal reconstruction of a selected 
portion of the insulin signaling pathway of KEGG (hsa04910). 

Panel A. The original signal cascade. Panel B. graphite signal 
reconstruction through chemical compound propagation. Numbers 
represent EntrezGene IDs. Panel C. KEGGgraph signal 
reconstruction. 



Insulin is an hormone controlling the balance between 
mobilization and storage of energy molecules. Insulin 
binds the Insulin Receptor (IR) and through phosphori- 
lation of the IRS adaptors is able to recruit and activate 
PI3K. PI3K is a kinase that converts PIP2 in PIP3 which 
is a secondary messenger involved in the regulation of 
various processes. The conversion between PIP3 into PI 
(3,4)P2 or PI(4,5)P2 operated by phosphatases like 
SHIP1/2 or PTEN induce a depletion of PIP3 levels and 
of consequence a reduced activity on its downstream 
targets [29]. 

PIP3 associates with the inner lipid bilayer of the 
plasma membrane to promote the recruitment of pro- 
teins with pleckstrin homology (PH) domains, like 
PDPK and AKT, which is a crucial mediator of various 
cell process, such as apoptosis, cell cycle, protein synth- 
esis, regulation of metabolism [30]. 

Among other functions, AKT activates also the cyclic 
nucleotide phosphodiesterases (PDEs), that is a group of 
enzymes able to regulate the localization, duration, and 
amplitude of the cyclic nucleotides. Signaling PDEs are 
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therefore important regulators of signal transduction 
mediated by these second messenger molecules [31]. In 
this pathway, PDE, depleting cAMP, indirectly inhibits 
the PKC mediated phosphorilation, and the activation of 
LIPE that is a lipase able to mobilize lipid energy stores. 
PDE acts, in this way, as a anti-lipolytic agents [32]. 

This hormonal mediated signaling cascade, from the 
insulin receptor to the inhibition of HSL, involves two 
"second messenger" compounds (PIP3 and cAMP) cru- 
cial for the transduction of the signal. 

In panel A of Figure 3 we report a part of the insulin 
signaling pathway of KEGG (hsa4910) that contains 
three groups OR (PDE3, AKT and PKA), and two com- 
pound mediated interactions (through PIP3 and cAMP). 
This is a clear examples of a signal cascade in which the 
propagation of the signal through compounds is crucial 
to keep the whole signaling path. In panel B we report 
graphite reconstructed signal cascade while in panel 
C the KEGGgraph partially reconstructed signal. 

An extract of the xml file corresponding to the signal 
reported in Figure 3 of the main text is present in Addi- 
tional file 1. From the xml definition it is evident how 
entry 2 (SKIP) and entry 3 (SHIP) are linked to com- 
pound 15 (PIP3) while there is no direct interaction 
between compound 15 (PIP3) and entry 62 (PDK1/2). 
This is why KEGGgraph misses the signal, while gra- 
phite captures it by splitting the relation between 
entry 52 (protein complex P13K) and 62 (PDK1/2) 
through compound 15 (PIP3) into both 52 to 15 and 15 
to 62. This dissection allows the reconstruction of the 
signal, otherwise impossible. 

3.2 Pathway conversion example 2: catalysis and cleavage 
of Notch 1 by Gamma Secretase Complex 

We selected the reaction 1784.3 from the Reactome 
pathway called "A third proteolytic cleavage releases 
NICD". Gamma secretase is a multi-subunit protease 
complex, itself an integral membrane protein, that 
cleaves single-pass transmembrane proteins at residues 
within the transmembrane domain. Here represented 
the processing of the Notch 1 protein. The gamma- 
secretase complex is composed of Presenilin homodimer 
(PSEN1 variant 1 or 2 or 3 or 4 or 5 and PSEN2 variant 
1 or 2), Nicastrin (NCSTN variant 1 or variant 2), 
APH1 (APH1A or APH1B) and PEN2. Maturation of 
the Notch receptor involves a cleavage of the protein, 
the intracellular domain is liberated from the plasma 
membrane that can enter into the nucleus to engage 
other DNA-binding proteins regulating gene expression. 
The cleavage is catalyzed and performed by Gamma 
Secretase Complex. 

Figure 4 shows Reactome representation of the reac- 
tions (Panel A), the BioPax information as it is stored in 
owl model and in Cytoscape plug-in for BioPax 



(respectively panel B and C) and the graphite final 
network (panel D). In the graphite network the 
nodes are annotated using the XRefs informations while 
edges preserve the type of the reaction annotated the 
OWL model. Distinction between OR complexes 
(formed by all the possibile variants of each protein) 
nested inside the AND complex of the Gamma secretase 
are topologically preserved in the resulting graph. 

3.3 graphite functions 

To access the Reactome, KEGG, Biocarta and NCI data- 
bases graphite uses respectively the lists reactome, 
kegg, biocarta and nci. A pathway network can be 
retrieved from one of the lists using the name of the 
pathway, 

> names (biocarta) [1:3] 

[1] "acetylation and deacetylat ion of 
rela in nucleus" 

[2] "actions of nitric oxide in the heart" 
[3] "activation of camp-dependent pro- 
tein kinase pka" 

> biocarta [ [" ras signaling pathway"] ] 
"ras signaling pathway" pathway from 

BioCarta 
Number of nodes = 18 
Number of edges = 22 
Type of identifiers = native 
Retrieved on = 2011-05-12 
or its position in the list of pathways: 

> p <- biocarta [ [175] ] 

> p@title 

[1] "ras signaling pathway" 

In the network, nodes represent genes and edges func- 
tional relationships among them. Nodes can have het- 
erogeneous IDs (according to the pathway original 
annotation) and edges can be characterized by multiple 
functional relationships. 

The function pathwayGraph builds a graphNEL 
object from a pathway p: 

> g < - pathwayGraph (p) 

> g 

A graphNEL graph with directed edges 
Number of Nodes = 18 
Number of Edges = 23 

> edgeData (g) [1] 

$ ^EntrezGene: 10928 | EntrezGene : 5879 ' 
$ 'EntrezGene: 10928 | EntrezGene : 58 7 9 
x $weight 
[1] 1 

$ x EntrezGene : 10928 | EntrezGene : 5879 

v $edgeType 

[1] "catalysisOut (ACTIVATION)" 
The function converter Identifiers allows the 

user to map such variety of IDs to a single type (Entrez- 
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Figure 4 Catalysis and cleavage of Notch 1 by Gamma Secretase Complex. Reactome representation of the reactions (Panel A), BioPax 
information as it is stored in owl model and in Cytoscape plug-in BioPax dedicated (respectively panel B and C) and the graphite final 
network (panel D). 



Gene or Gene Symbol). For the ID conversion gra- 
phite uses the data provided by the Bioconductor 
package org . Hs . eg . db. This mapping process, how- 
ever, may lead to the loss of some nodes (not all identi- 
fiers may be recognized) and has an impact on the 
topology of the network (one ID may correspond to 
multiple IDs in another annotation or vice versa). 



> pEntrez <- convert Identifiers (p, 
"entrez" ) 

> pEntrez 

"ras signaling pathway" pathway from 
BioCarta 
Number of nodes = 20 
Number of edges = 20 
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Type of identifiers = Entrez Gene 
Retrieved on = 2011-05-12 
> nodes (pEntrez) [1:10] 

[1] "10928" "1147" "3265" "387" "4303" 
"5295" "572" "5879" "5894" "5898" 

Several pathways have a huge number of nodes and 
edges, thus there is the need of an efficient system of 
visualization. To this end graphite uses the Rcytos- 
cape package to export the network to Cytoscape [27]. 
Cytoscape is a Java based software specifically built to 
manage biological network complexity and for this rea- 
son it is widely used by the biological community. Run 
Cytoscape with RPC plugin enabled and type at the R 
command prompt: 

> cytoscapePlot (convert Identifiers (a 
$' toll -like receptor pathway' , "symbol") ) 

The network will be automatically loaded into Cytos- 
cape. See Figure 5 for the result of this operation. 

3.4 Simulation study: compound propagated signal 
improves topological analysis 

In order to verify our signal propagation strategy we 
perform a simulation study. Using the insulin signaling 
pathway of the KEGG database we select as differentially 
expressed 22 genes lying on the signal paths highlighted 
in Figure 6A. These genes are connected if propagation 
is employed, otherwise they are disconnected (see Figure 
6C and 6D for propagation and non-propagation respec- 
tively). We expect that propagation will lead to better 
results in terms of topological analyses. 

Our simulation is based on the following steps: 1) we 
randomly generate \a fc ~ U (2, 10); 2) we randomly gen- 
erate log fold change values [S t for i = 1,..., 22) of the 
differentially expressed genes as S t ~ N (^ FC , 2) (interac- 
tions of the signal paths selected are characterized all by 
activation, thus, fold changes have the same sign); 3) we 
run the SPIA[5] algorithm on the Insulin signaling 
pathway with and without signal propagations and we 




Figure 5 Visualization of graphite network using RCytoscape 
package. 

V / 



take the p-value of the topological analysis (pPERT); 4) 
we repeat from step 1 10,000 times. 

As shown in Figure 6B the distribution of the topolo- 
gical significance p-values in case of signal propagation 
is shifted towards lower values with respect to the case 
of non-propagation. Propagation p-value distribution is 
not only centered on 0.1 (while the one with non-propa- 
gation is centered on 0.3) but is also less variable. As 
expected the same results are obtained simulating nega- 
tive fold changes (data not shown). This finding demon- 
strate that compound mediating signal propagation 
improves topological analyses giving more reliable 
results. 

3.5 Example of topological analysis: B-lineage Adult Acute 
Lymphocytic Leukemia 
3.5.7 Data 

The dataset, recently published by [33], characterizes 
gene expression signatures in acute lymphocytic leuke- 
mia (ALL) cells associated with known genotypic 
abnormalities in adult patients. Several distinct genetic 
mechanisms lead to acute lymphocytic leukemia (ALL) 
malignant transformations deriving from distinct lym- 
phoid precursor cells that have been committed to 
either T-lineage or B-lineage differentiation. Chromo- 
some translocations and molecular rearrangements are 
common events in B-lineage ALL and reflect distinct 
mechanisms of transformation. The relative frequencies 
of specific molecular rearrangements differ in children 
and adults with B-lineage ALL. The BCR breakpoint 
cluster region and the c-abl oncogene 1 (BCR/ABL) 
gene rearrangement occurs in about 25% of cases in 
adult ALL, and much less frequently in pediatric ALL. 

Data is available at the Bioconductor site (http://www. 
bioconductor.org/help/publications/2003/Chiaretti/chiar- 
etti2/) Expression values, appropriately normalized 
according to rma and quantile normalization, derived 
from Affymetrix single channel technology, consist of 37 
observations from one experimental condition (n x = 37, 
BCR; presence of BCR/ABL gene rearrangement) and 41 
observations from another experimental condition (n 2 - 
41, NEG; absence of rearrangement). Probes platform 
have been annotate using EntrezGene custom CDF ver- 
sion 14 [34]. Given the involvement of BCR and ABL 
genes in the chimera rearrangement, we expect these 
genes playing a central role in the gene set analysis; 
thus, most of the pathways containing BCR and/or ABL 
genes should be found as significant. 
3.5.2 Results 

We report the results obtained by SPIA[5] and topo- 
logyGSA[6] on the graphite networks. These statis- 
tical tests are based on completely different null 
hypotheses; while SPIA needs the list of differentially 
expressed genes, topologyGSA performs two statistical 
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without propagation 
- with propagation 




Figure 6 Results of the simulation study on the Insulin signaling pathway compound mediated signal propagation. Panel A. Signal 
paths selected to be differentially expressed. Panel B. p-value distribution of the topological analysis spia (pPERT) with and without 
propagation. Panel C. graphite network obtained from insulin pathway with propagation. Panel D. network obtained from insulin pathway 
without propagation. 



tests (to compare the mean and the variance of the 
pathway between two groups) on the entire list of genes 
belonging to a pathway. Here, differentially expressed 
genes required for SPIA package have been identified 
using RankProd test [35] (FDR <0.01), while the test 
on the mean has been chosen for topologyGSA 
package. 



Tables 3 and 4 reports the list of significant pathways 
identified by the above approaches; pathways marked 
with S are those containing BCR and/or ABL genes. It 
is interesting to observe that several pathways contain- 
ing either BCR and ABL genes were identified as 
deregulated especially with topologyGSA. Then, as 
expected, several additional pathways associated to 
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Table 3 Pathway analysis performed using SPIA statistical test on graphite networks. 



Name FDR Signal Database BCR ABL 
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cancer progression, apoptosis, cell cycle, cell prolifera- 
tion and inflammation have been selected as significant. 

Leaving the comparison between topological analyses 
aside (because it is out of the scope of the present 
work), the results testify the feasibility of performing 
analyses using graphite and the ability to obtain reli- 
able results independently of the chosen analysis 
method. In addition, for the first time, thanks to gra- 
phite all the topological methods gain the access to 
pathway repositories previously not considered. 

Our results highlight that the hierarchical pathway 
structure and the reduced dimension of the pathways 
characterizing respectively the Reactome and Biocarta 
databases jointly with the specialized cancer pathways of 



the NCI databases allow the user to have deeper insight 
into the data. 

To highlight the usefulness of topological analysis in 
the context of transcriptomic data interpretation, we 
report two graphite networks identified as signifi- 
cantly altered in the previous analysis. 

Chronic myeloid leukemia pathway includes both 
genes, BCR and ABL1, and was identified as differen- 
tially expressed between BCR/ ABL positive and negative 
patients by topologyGSA. Figure 7 shows the chronic 
myeloid leukemia graphite network from KEGG 
database with differentially expressed genes mapped 
with different colors according to fold change sign. It is 
interesting to note the presence of several OR groups (e. 
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Table 4 Pathway analysis performed using topologyGSA statistical test on graphite networks. 



Name FDR Database BCR ABL 
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g. PI3K, AKT, IKK, CBL gene families), single members 
of which resulted to be differentially expressed. Two 
clear deregulated paths starting from BCR and ABL1 
genes towards apoptosis and NFKB pathways highlight 
the power of topological analysis to deeper investigate 
signal cascades within large pathways. 

3.6 R/Bioconductor packages 

Currently there are two Bioconductor packages that try 
to convert pathway to the SIF model. KEGGgraph[9] 



that parses KGML format but i) considers all group of 
genes as groups OR and ii) completely removes com- 
pounds without propagation and NCIgraph that 
imports BioPax models from Cytoscape [27] without 
taking into account groups and compound propagation 
(compounds become nodes of the network) and uses 
internal IDs as node labels. The package allows to retain 
only nodes with EntrezGene IDs loosing all the other 
nodes. Thus, both of them are not suitable for topologi- 
cal pathway analyses. 
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4 Conclusions 

It is evident that gene set analysis is moving towards con- 
sidering pathway topology as a crucial feature. A correct 
conversion of the pathway topology to a gene network 
becomes therefore important. Available packages are not 
able to correctly reconstruct the signal transduction in 
most cases, graphite, on the other hand, is an innovative 
package able to gather and make easily available the con- 
tents of the four major pathway databases. In the field of 
topological analysis graphite acts as a provider of biolo- 
gical information by reducing the pathway complexity con- 
sidering the biological meaning of the pathway elements. 

5 Availability and requirements 

♦ Project name: graphite 

♦ Project home page: http://www.bioconductor.org/ 
packages/devel/bioc/html/graphite.html ♦ Operating sys- 
tem^): Platform independent 

♦ Programming language: R 

♦ Other requirements: Bioconductor 

♦ License: GNU AGPL 

♦ Any restrictions to use by non-academics: none 
Additional material 



Additional file 1: Example of KGML and owl. Additional file 1: bmc 
supp.pdf, 116 K. http://www.biomedcentral.com/imedia/ 
1 501 53797661 3594/supp1 .pdf 
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